Back

BMC Genomics

Springer Science and Business Media LLC

Preprints posted in the last 7 days, ranked by how well they match BMC Genomics's content profile, based on 328 papers previously published here. The average preprint has a 0.19% match score for this journal, so anything above that is already an above-average fit.

1
Systematic evaluation of 24 extraction and library preparation combinations for metagenomic sequencing of SARS-CoV-2 in saliva

Qian, K.; Abhyankar, V.; Keo, D.; Zarceno, P.; Toy, T.; Eskin, E.; Arboleda, V. A.

2026-04-20 genomics 10.64898/2026.04.16.719115 medRxiv
Top 0.2%
6.9%
Show abstract

Sequencing the respiratory tract transcriptome has the potential to provide insights into infectious pathogens and the hosts immune response. While DNA-based sequencing is more standard in clinical laboratories due to its stability, RNA assays offer unique advantages. RNA reflects dynamic physiological changes, and for RNA viruses, viral RNA particles directly represent copies of the viral genome, enabling greater diagnostic sensitivity. However, RNAs susceptibility to degradation remains a significant challenge, particularly in RNase-rich specimens like saliva. To address this, we conducted a systematic, combinatorial evaluation of 24 distinct mNGS workflows, crossing eight nucleic acid extraction methods with three RNA-Seq library preparation protocols. Remnant saliva samples (n = 6) were pooled and spiked with MS2 phage as a control. The SARS-CoV-2 virus was spiked into half of the samples, which were extracted using the eight different extraction methods (n = 3) and compared using RNA Integrity Number equivalent (RINe) scores and RNA concentration. The extracted RNA was then processed across the three library construction methods and subjected to short-read sequencing to assess all 24 combinations head-to-head. We compared methods based on viral read recovery and found that RINe and concentration did not correlate with viral detection. The Zymo Quick-RNA Magbead kit and the Tecan Revelo RNA-Seq High-Sensitivity RNA library kit were the extraction and library-preparation kits that yielded the most SARS-CoV-2 reads, respectively. Importantly, our combinatorial analysis revealed that any small variability attributable to different nucleic acid extraction methods was heavily overshadowed by differences in quality attributable to the RNA-Seq library preparation methods. These findings challenge the reliance on conventional RNA quality metrics for clinical metagenomics and underscore the need to redefine extraction quality standards for mNGS applications. IMPORTANCEmNGS is a powerful and unbiased approach towards pathogen detection that has mostly been applied to blood and cerebrospinal fluid samples. However mNGS has recently been applied to more areas including the respiratory pathogen detection space, with potential applications in both in-patient diagnostics and public health surveillance. Saliva samples are an ideal sample type for these use cases since they can be collected non-invasively. However, saliva is also a challenging sample type due to its high RNase activity and often yields low-quality nucleic acid. This study explores the feasibility of using saliva specimens in mNGS with contrived SARS-CoV-2 samples to optimize the combination of two factors: nucleic acid extraction and RNA-seq library preparation. Exploration in this area could enhance the sensitivity of saliva-based mNGS assays, with the goal of future expansion of this specimen type in clinical diagnostics and public health surveillance. Key PointsO_LIThe choice of RNA-Seq library preparation kit has a greater impact on pathogen detection than the nucleic acid extraction method. C_LIO_LIThe combination of Zymo Quick-RNA Magbead extraction kit and TECAN Revelo RNA-Seq High Sensitivity RNA library kit recovered the highest percentage of total SARS-CoV-2 reads. C_LIO_LIRNA quantity and RINe score do not correlate with viral read capture, indicating a need for an alternative metric to assess RNA quality for downstream mNGS clinical diagnostics. C_LI

2
REPLAY: A reproducible and user-friendly application for DNA replication timing analysis from Repli-seq data

Dickinson, Q.; Yu, C.; Rivera-Mulia, J. C.

2026-04-21 genomics 10.64898/2026.04.16.719037 medRxiv
Top 0.5%
4.7%
Show abstract

BackgroundDNA replication timing (RT) is a fundamental feature of genome organization that is regulated in a cell-type-specific manner and frequently altered in disease. Repli-seq is the standard approach for genome-wide RT profiling; however, its analysis typically requires multiple independent tools and custom scripts, limiting reproducibility, portability, and accessibility, particularly for users without computational expertise. In addition, existing workflows often lack standardization and require substantial user intervention. ResultsWe developed REPLAY, a fully automated, reproducible, and user-friendly application for replication timing analysis. REPLAY is distributed as a standalone executable that enables end-to-end processing from compressed FASTQ files to genome-wide RT profiles without requiring software installation or programming experience. Through an intuitive graphical interface, users can configure analysis parameters, including input and output directories, reference genome, normalization strategy (quantile, median, or interquartile range), and smoothing. The application integrates all processing steps--quality control, trimming, alignment, binning, RT log2 calculation, normalization, smoothing, and visualization-- within a single automated workflow. Application of REPLAY to publicly available datasets demonstrate accurate reconstruction of RT profiles and high reproducibility across samples. ConclusionsREPLAY offers a portable, reproducible, and accessible solution for the analysis of RT data. By eliminating the need for command-line tools and complex installations, it lowers the entry barrier enabling standardized analysis across diverse research settings.

3
A Seychelles warbler genomic toolkit

Lee, K. G. L.; Bartleet-Cross, C.; Gonzalez-Mollinedo, S.; Dong, S.; Pinto, A.; Lee, C. Z.; Sparks, A.; van de Velde, M.; Manarelli, M.-E.; Holden, T.; Tucker, R.; Maher, K. H.; Hipperson, H.; Slate, J.; Komdeur, J.; Richardson, D.; Dugdale, H.; Burke, T.

2026-04-21 genomics 10.64898/2026.04.16.719046 medRxiv
Top 0.5%
4.7%
Show abstract

Understanding evolutionary processes is greatly facilitated by high-quality data on genetic variation. We report the development of a genomic toolkit for a recently bottlenecked, long-term studied species, the Seychelles warbler (Ptimerl dezil; Acrocephalus sechellensis). This toolkit comprises a reference genome assembled into 31 chromosomes, together with functional annotations and reference-panel-free imputation of whole-genome sequences from 1,935 individuals. The genomic data have been used to assign the sequenced individuals into a genetic pedigree. Individual genomic data are associated with a suite of phenotypic metadata, amassed from three decades of fieldwork in this closed, long-term monitored population. We compared sex and parentage assigned using the genomic data with the previously recorded sex and parentage metadata to identify and correct 41 sample DNA samples labelled with the wrong identity. This population resource enables a wide range of analyses, that include, but are not limited to phylogenetics, metabarcoding, recombination rates, linkage patterns, adaptation, heritability, demographic history, selection, and inbreeding estimates. We wish to encourage interest from researchers seeking to collaborate on future analyses and data collection. Overall, our methods demonstrate the potential of next generation sequencing and statistical tools to provide dense genomic datasets at large sample sizes for wild populations.

4
MutaPhy: A clade-based framework to detect genotype-phenotype associations on phylogenetic trees

Ngo, A.; Guindon, S.; Pedergnana, V.

2026-04-21 evolutionary biology 10.64898/2026.04.19.719535 medRxiv
Top 1%
3.1%
Show abstract

Understanding how genetic variation in pathogens influences clinical phenotypes observed in infected hosts is a fundamental challenge in evolutionary genomics and public health. Phenotypic traits such as infection severity are often non-randomly distributed within the pathogens phylogeny, suggesting the existence of evolutionary determinants but also violating the independence assumption underlying classical genome-wide association studies and potentially leading to inflated false positive rates. We present MutaPhy, a phylogeny-based method aimed at detecting correlations between a binary host phenotype and the corresponding pathogen genome by directly utilizing the hierarchical structure of phylogenetic trees. MutaPhy encompasses three different scales: (i) a subtree scale, on which relevant clades over-representing the phenotype of interest are detected using permutation-based tests; (ii) a tree scale, which agglomerates local signals into a global association statistics; and (iii) a site scale, whereby candidate mutational events on branches leading to significant clades are examined using ancestral sequence reconstruction. We evaluate the statistical behavior and detection performance of MutaPhy using simulations under diverse evolutionary scenarios. We also compare this tool to several existing phylogenetic association methods. As illustrative applications, we apply MutaPhy to dengue virus and hepatitis C virus datasets associated to clinical phenotypes in human hosts. Our results highlight the ability of the proposed approach to detect viral lineages associated to over-represented phenotypes while revealing limited evidence for robust mutation-level associations in these particular datasets. Altogether, MutaPhy provides a framework for guiding genotype-phenotype association analyses by leveraging phylogenetic structure, thereby reducing false positive findings and improving the interpretability of association signals.

5
International Adaptation of a brief Problem-Solving Skills (the IAPPS trial) training for people in custody with severe mental illness in Poland: an open multicentred, parallel group, feasibility randomised controlled trial.

Perry, A. E.; Zawadzka, M.; Rychlik, J.; Hewitt, C.

2026-04-25 forensic medicine 10.64898/2026.04.24.26351654 medRxiv
Top 1%
2.2%
Show abstract

Objectives: The primary aim of this study was to assess the feasibility of delivering an adapted problem-solving skills (PSS) intervention by quantifying the recruitment, follow-up and completion rates using a brief problem-solving intervention for people with a mental health diagnosis in two Polish prisons. Design: IAPPS is an open, multi-centred, parallel group feasibility randomised controlled trial (RCT). Setting: Two prisons in Poland. Participants: Men in custody aged 18 years and older, having a mental illness and living within the prison therapeutic unit. Interventions: The intervention consisted of an adapted PSS skills intervention plus care as usual (CAU) or care as usual only. Delivered in groups of up to five people in 1.5-hour sessions over the course of two weeks. Main outcome measures: Primary outcomes - rate of recruitment, follow-up, and feasibility to deliver the intervention. Secondary outcomes included measures of depression, general mental health, and coping strategies. Results: 129 male prisoners were screened, 64 were randomly allocated, with a mean age of 53.5 years (SD 14, range 23-84). 59 (95%) prisoners were of Polish origin. Our recruitment rate was 48%. There was differential follow up with those in the intervention group less likely to complete the post-test battery versus those who received care as usual. Outcome measures were successfully collected at both time points. Conclusions We were able to recruit, retain and deliver the intervention within the prison setting; some logistical challenges limited our assessment of intervention engagement. Our data helps to demonstrate how use of the RCT study design can be implemented and delivered within the complex prison environment. Trial registration number ISRCTN 70138247, protocol registration date May 2021

6
Comparative analysis of transposable elements in jellyfish and hydroid species (Cnidaria: Medusozoa)

Mays, A.; Cabrera, F.; Macias-Munoz, A.

2026-04-21 evolutionary biology 10.64898/2026.04.17.719288 medRxiv
Top 2%
2.1%
Show abstract

BackgroundTransposable elements (TEs) are repetitive genetic elements that can jump to new loci causing genome expansions, structural rearrangements, and can, ultimately, propel the evolution of genomes. Despite their significance, the role of TEs in the evolution of genomes and phylogenetic groups remains largely understudied in early diverging lineages. Further, the extent to which TE content varies across species is still an open question. Medusozoa, a group within Cnidaria encompassing jellyfish and hydroids, exhibits an exceptional diversity of life history strategies, body plans, and physiological capabilities. These characteristics, along with its early-diverging phylogenetic position, establish Medusozoa as an ideal system for investigating the composition and evolutionary history of TEs within the group. ResultsWe generated a custom repeat library built from annotations of 25 Medusozoan genomes and used it to characterize TEs, aiming to identify lineage-specific TE content and activity that may correlate with the diversity observed within the group. We found that repetitive element percentage and genome size varied considerably, with Hydrozoa exhibiting the most variation among classes in both respects. DNA transposons were the most prevalent TE classification in all but two genomes, averaging 28% of all genomes. Intra-genus comparisons revealed a surprising degree of differences in TE content. In the genus Aurelia, the expansion of a single DNA transposon superfamily accounted for much of the difference in repetitive element percentage between two species, whereas in the genus Turritopsis, a similar divergence resulted from the proliferation of multiple superfamilies. Interestingly, most genomes showed evidence of recent TE expansions, suggesting ongoing activity in many medusozoan species. ConclusionWe present the first comparative analysis of TEs across all medusozoan classes. Our results reveal class-specific TE dynamics and highlight cases of TE proliferations as lineages diverge. This research provides data on TE activity and diversity that can be used as a resource for future study and fills important gaps in our understanding of TEs in early diverging animal lineages.

7
Closely related, yet phenotypically different - Genome assemblies of two sister species of widow spiders: Latrodectus hasselti and L. katipo, Theridiidae

Ivanov, V.; Uludag, K. O.; Schöneberg, Y.; Schneider, J. M.; Kennedy, S.; Hamadou, A. B.; Vink, C. J.; Krehenwinkel, H.

2026-04-21 genomics 10.64898/2026.04.17.719154 medRxiv
Top 2%
2.1%
Show abstract

Widow spiders of the genus Latrodectus are important animals for biomedical, pest and conservation research. Here, we present the assembled genomes of two closely related Latrodectus species: the Australian L. hasselti and the New Zealand endemic L. katipo. The genome of L. katipo consists of 13 scaffolds likely corresponding to chromosomes (90% of the total length) and 1267 short scaffolds (10%). It has a total length of 1.5 Gbp and BUSCO of 94.9%. The genome of L. hasselti consists of 379 scaffolds and has a total length of 1.7 Gbp and a BUSCO score of 95.4%. The repeat content is very similar in both genomes with a total proportion of 37.2% for L. katipo and 39.9% for L. hasselti. Genome annotation predicted 12706 and 15111 genes for L. katipo and L. hasselti respectively. An ortholog analysis shows large overlap between orthogroups suggesting either duplication events in L. hasselti or loss of genes in L. katipo.

8
DNAharvester: A Nextflow Pipeline for Analysing Highly Degraded DNA from Ancient and Historical Specimens

Sharif, B.; Kutschera, V. E.; Oskolkov, N.; Guinet, B.; Lord, E.; Chacon-Duque, J. C.; Oppenheimer, J.; van der Valk, T.; Diez-del-Molino, D.; D. Heintzman, P.; Dalen, L.

2026-04-21 bioinformatics 10.64898/2026.04.20.719564 medRxiv
Top 2%
1.9%
Show abstract

Ancient DNA (aDNA) research has advanced rapidly with the development of high-throughput sequencing, which now enables genome-wide analyses of large collections of prehistoric specimens. However, analysing palaeontological and archaeological material with highly degraded DNA constitutes a major bioinformatic challenge. DNA from such samples is characterised by short fragment lengths, low endogenous content, post-mortem damage, and considerable cross-species contamination, which can increase spurious mapping and reference bias, affecting downstream population genetic inferences. Here we present DNAharvester, a modular and reproducible pipeline designed specifically for the processing of highly degraded DNA from ancient and historical specimens. DNAharvester integrates metagenomic filtering before mapping, competitive mapping, adaptive aligner selection (incorporating algorithms such as BWA-aln, BWA-mem, and Bowtie2), and systematic evaluation of reference bias and spurious mapping. By incorporating flexible mapping and filtering strategies, the pipeline can be adapted to varying sample preservation, with a distinct focus on maximising authentic data recovery from highly degraded material. Furthermore, DNAharvester features comprehensive subworkflows for iterative assembly of mitogenomes, identification of genomic repeats and CpG sites, taxonomic classification, microbial/pathogen screening of unmapped reads, genetic sex determination, and variant calling for downstream analyses. To accommodate datasets with varying sequencing depths, the pipeline incorporates multiple variant calling strategies, including diploid variant calling, genotype likelihood estimation, and pseudo-haploid random allele calling. Implemented in Nextflow, DNAharvester provides a highly scalable, containerised framework that enhances reproducibility, portability, and robustness in aDNA analyses. We validated the pipeline across a gradient of simulated scenarios and empirical datasets, demonstrating its ability to systematically mitigate complex background contamination while preserving authentic genomic signals even in the most challenging of circumstances. By streamlining complex bioinformatic tasks through simple configuration files, DNAharvester establishes a standardised approach for the rigorous analysis of highly degraded DNA datasets and makes genomic analyses of ancient remains accessible to the broader research community.

9
Comparative fine-mapping of breast cancer susceptibility loci using summary statistics methods and multinomial regression

O'Mahony, D. G.; Beasley, J.; Zanti, M.; Dennis, J.; Dutta, D.; Kraft, P.; Kristensen, V.; Chenevix-Trench, G.; Easton, D. F.; Michailidou, K.

2026-04-22 epidemiology 10.64898/2026.04.21.26351364 medRxiv
Top 2%
1.9%
Show abstract

Summary statistics fine-mapping methods offer advantages over classical methods, including avoiding data-sharing constraints and improved modelling of correlated variables and sparse effects. However, its performance has not been comprehensively evaluated in breast cancer using real-world data. Previous multinomial stepwise regression (MNR) fine-mapping analyses for breast cancer identified 196 credible sets. Here, we apply summary statistics fine-mapping, compare methods, and assess parameters influencing performance. Using summary statistics from the Breast Cancer Association Consortium, we compared finiMOM, SuSiE, and FINEMAP to published MNR results across 129 regions. Performance was assessed by recall using in-sample and out-of-sample LD. Discordant credible sets were examined for technical factors, and target genes were defined using the INQUISIT pipeline. SuSiE showed the closest agreement with MNR. Results varied across regions depending on the assumed number of causal variants (L), with higher values reducing recall and no single L maximising performance. At optimal L per region, SuSiE identified 8,192 CCVs in 244 credible sets, with recall of 88%, 86%, and 72% for overall, ER-positive, and ER-negative breast cancer. Thirty MNR sets were missed. Discordance was partially explained by allele flips, imputation quality, and array heterogeneity. Fifty-two MNR-identified genes, including BRCA2, WNT7B and CREBBP were not recovered, while additional candidate genes were identified. Using out-of-sample LD reduced recall by 3% but identified novel variants. Fine-mapping results vary across methods, and no single approach is sufficient. The choice of L strongly influences results, and combining analytical approaches with functional validation can improve causal variant identification.

10
Tracking and predicting the dynamics of HIV-1 epidemics in France using virus genomic data

Colliot, L.; Garrot, V.; Petit, P.; Zhukova, A.; Chaix, M.-L.; Mayer, L.; Alizon, S.

2026-04-24 epidemiology 10.64898/2026.04.21.26351380 medRxiv
Top 3%
1.5%
Show abstract

Understanding the dynamics of HIV epidemics is important to control them effectively. Classical methods that mainly rely on occurrence data are limited by the fact that an unknown part of the epidemic eludes sampling. Since the early 2000s, phylodynamic methods have enabled the estimation of key epidemiological parameters from virus genetic sequence data. These methods have the advantage of being less sensitive to partial sampling and to provide insights about epidemic history that even predates the first samples. In this study, we analysed 2,205 HIV sequences from the French ANRS PRIMO C06 cohort. We identified and were able to reconstruct the temporal dynamics of two large clades that represent the HIV-1 epidemics in the country. Using Bayesian phylodynamic inference models, we found that the first clade, from subtype B, originated in the end of 1970s, grew rapidly during the 80s before decreasing from 2000 to 2015 and stagnating since then. The second clade, from circulating recombinant form CRF02_AG, emerged and spread in the 80s, grew again in the early 2000s, before declining slightly. We also estimated key epidemiological parameters associated with each clade. Finally, using numerical simulations, we investigated prospective scenarios and assessed the possibility to meet the 2030 UNAIDS targets. This is one of the rare studies to analyse the HIV epidemic in France using molecular epidemiology methods. It highlights the value of routine HIV sequence data for studying past epidemic trends or designing public health policies.

11
A phylogenetic approach reveals evolutionary aspects and novel genes of bradyzoite conversion in Toxoplasma gondii

C A, A.; Upadhayay, R.; Patankar, S. A.

2026-04-21 bioinformatics 10.64898/2026.04.20.719551 medRxiv
Top 3%
1.5%
Show abstract

Toxoplasma gondii is a widespread human pathogen that has multiple, clinically relevant stages in its complex life cycle, including fast-replicating tachyzoites and latent bradyzoites. Bradyzoite differentiation is triggered by stress responses that lead to changes in transcription, translation, and metabolism. Two aspects of this process are addressed in this report: first, whether proteins that play roles in bradyzoite differentiation are specific to T. gondii and other bradyzoite-forming parasites of the Sarcocystidae family, and second, whether new bradyzoite differentiation proteins can be identified in T. gondii. To answer these questions, a phylogenetic approach was used, comparing proteomes of select members of the Sarcocystidae family that form morphologically different bradyzoite cysts and members of the Eimeriidae family that do not form cysts. This approach resulted in 8 distinct clusters of T. gondii proteins that reflected different conservation patterns; for example, one cluster showed conservation among all organisms, while another showed conservation in bradyzoite cyst-forming organisms. Known T. gondii proteins involved in bradyzoite differentiation were found in all clusters, indicating that this process uses both highly conserved pathways as well as bradyzoite-specific pathways. Importantly, the cluster containing proteins that are conserved in bradyzoite-forming organisms contained several known regulators of bradyzoites, and will be a source for identifying novel T. gondii proteins that are involved in bradyzoite differentiation.

12
Pan1c : a pipeline to easily build chromosome-level pangenome graphs

Mergez, A.; Racoupeau, M.; Bardou, P.; Linard, B.; Legeai, F.; Choulet, F.; Gaspin, C.; Klopp, C.

2026-04-21 bioinformatics 10.64898/2026.04.17.719212 medRxiv
Top 3%
1.3%
Show abstract

The advances of sequencing technologies and the availability of high-quality genome assemblies for many genotypes per species, give the opportunity to improve sequence alignment rate and quality, and the variant calling accuracy by including all genomic variations in a graph reference, called a pangenome graph. Because the process of building and analysing a pangenome graph is still complex, with related software packages under development, there is an important need for releasing user-friendly pipelines for this emerging research area. Pan1C is a pipeline based on a chromosome-by-chromosome graph construction strategy. It integrates two complementary strategies for building pangenomes and produces informative metric plots and graphics using a large set of tools. By benchmarking Pan1C on human, fungal, and wheat assemblies, which span a wide range of genome sizes and complexities, we showed the interest of Pan1C for assembly and graph validation as well as for performing primary analyses.

13
Genome-wide identification and characterization of the NAC transcription factor family in Cynodon dactylon and their expression during abiotic stresses

Poudel, A.; Wu, Y.

2026-04-20 bioinformatics 10.64898/2026.04.15.718725 medRxiv
Top 4%
1.2%
Show abstract

Common bermudagrass (Cynodon dactylon) is a highly resilient and cosmopolitan grass widely used for turf, forage, and soil stabilization. Although its genome has been sequenced, little study has focused on characterizing genes underlying its resilience, including the NAC transcription factor family, which is well known for its physiological and stress-related functions. This study aimed to systematically characterize NAC TF genes in the bermudagrass genome and assess their potential roles in abiotic stress tolerance. A total of 237 CdNAC genes were identified and phylogenetically classified into 14 groups, including 40 members in the NAM/NAC1 class, which is associated with plant growth and development, and 23 members in the SNAC class, which is associated with stress responses. Tissue-specific RNA-seq analysis indicated that about one-fourth of CdNAC genes were expressed across all tissues, whereas 13 genes showed relatively higher expression in roots and 9 in inflorescence, suggesting both essential and specialized functions. Stress-responsive expression profiling revealed that 35 CdNAC genes were upregulated in response to drought, 43 to heat, 10 to salt, and 42 to submergence stress. Notably, CdNAC122, 149, and 155, the members of SNAC class, were consistently upregulated across all stress conditions, while others exhibited stress-specific expression, such as CdNAC37, 130, 145, and 199 in drought, CdNAC7, 12, 18, and 29 in heat, CdNAC46 and 151 in salt, and CdNAC9 and 31 in submergence. In contrast, 53 genes were downregulated during different stresses, with most belonging to NAM/NAC1, TERN, or OsNAC7 classes, possibly reflecting suppression of photosynthesis and development-related processes under stress. These results provide the first comprehensive characterization of CdNAC genes, reveal their distinct regulatory roles in abiotic stress responses, and establish a foundation for future functional validation and applications in breeding of stress-resilient bermudagrass.

14
Novel Genetic Risk Loci for Pancreatic Ductal Adenocarcinoma Identified in a Genome-wide Study of African Ancestry Individuals

Vergara, C.; Ni, Z.; Zhong, J.; McKean, D.; Connelly, K. E.; Antwi, S. O.; Arslan, A. A.; Bracci, P. M.; Du, M.; Gallinger, S.; Genkinger, J.; Haiman, C. A.; Hassan, M.; Hung, R. J.; Huff, C.; Kooperberg, C.; Kastrinos, F.; LeMarchand, L.; Lee, W.; Lynch, S. M.; Moore, S. C.; Oberg, A. L.; Park, M. A.; Permuth, J. B.; Risch, H. A.; Scheet, P.; Schwartz, A.; Shu, X.-O.; Stolzenberg-Solomon, R. Z.; Wolpin, B. M.; Zheng, W.; Albanes, D.; Andreotti, G.; Bamlet, W. R.; Beane-Freeman, L.; Berndt, S. I.; Brennan, P.; Buring, J. E.; Cabrera-Castro, N.; Campa, D.; Canzian, F.; Chanock, S. J.; Chen, Y.;

2026-04-22 genetic and genomic medicine 10.64898/2026.04.21.26351329 medRxiv
Top 4%
1.2%
Show abstract

Pancreatic cancer disproportionately affects Black individuals in the United States, but they have limited representation in genetic studies of pancreatic ductal adenocarcinoma (PDAC). To address this gap, we performed admixture mapping and genome-wide association analysis (GWAS) in genetically inferred African ancestry individuals (1,030 cases and 889 controls). Admixture mapping identified three regions with a significantly higher proportion of African ancestry in cases compared to controls (5q33.3, 10p1, 22q12.3). GWAS identified a genome-wide significant association at 5p15.33 (CLPTM1L, rs383009:T>C, T Allele Frequency=0.51, OR:1.45, P value=1.24x10-8), a locus previously associated with PDAC. Known loci at 5p15.33, 7q32.3, 8q24.21 and 7q25.1 also replicated (P value <0.01). Multi-ancestral fine-mapping identified two potential causal SNPs (rs3830069 and rs2735940) at 5p15.33. Collectively these findings identified novel PDAC risk loci and expanded our understanding of this deadly cancer in underrepresented populations, emphasizing the multifactorial nature of PDAC risk including inherited genetic and non-genetic factors. Statement of SignificanceTo understand how genetic variation contributes to PDAC risk in Black people in North American, we studied individuals of genetically-inferred African ancestry. We identified novel risk loci and differences in the contribution of known loci. This demonstrates that ancestry-informed genetic analyses improve our understanding of PDAC risk and enhances discovery.

15
Network-Based Functional Fragility Reveals System-Level Reorganization Of The Gut Microbiome In Inflammatory Bowel Disease

Kenavdekar, M. V.; Natarajan, E.

2026-04-21 bioinformatics 10.64898/2026.04.16.719113 medRxiv
Top 4%
1.2%
Show abstract

The human gut microbiome plays a critical role in host health, yet its functional organization in disease remains poorly understood. Most studies focus on taxonomic composition or pathway abundance, which fail to capture higher-order interactions governing system-level behavior. Here, we investigated microbiome functional organization in inflammatory bowel disease (IBD), including Crohns disease (CD), ulcerative colitis (UC), and healthy controls (HC), using a network-based framework across 60 metagenomic samples. Functional pathway profiles were used to construct correlation-based interaction networks, followed by analysis of network topology, functional redundancy, keystone pathway architecture, and system robustness. Disease-associated networks (CD and UC) exhibited reduced global connectivity, increased modular fragmentation, and centralization of keystone pathways, indicating a shift from distributed organization to more fragmented and fragile network structures compared to healthy controls. Notably, machine learning models demonstrated that network-derived features achieved higher classification performance (accuracy up to 0.824) compared to redundancy-based measures. These findings reveal that microbiome dysfunction in IBD is driven by large-scale reorganization of functional interaction networks rather than loss of functional capacity. This study highlights the importance of network-level analysis in understanding microbiome-associated disease and provides a systems-level framework for future research.

16
GNOMES: an integrated framework for genome-wide normalization and differential binding analysis of CUT&RUN and ChIP-seq data

Roule, T.; Akizu, N.

2026-04-21 bioinformatics 10.64898/2026.04.16.718722 medRxiv
Top 4%
1.0%
Show abstract

BackgroundDespite their use, quantitative comparison of epigenomic datasets such as ChIP-seq and CUT&RUN remains challenging, particularly due to difficulties in signal normalization across samples and conditions. Normalization solely based on sequencing depth is often insufficient due to the high variability in signal-to-noise ratios across samples, even from a same experiment. While exogeneous spike-in normalization can address some issues, robust spike-in controls are not always available, and may introduce additional experimental burden and computational complexity. Furthermore, normalization and differential binding analysis are typically performed using separate bioinformatics tools. Indeed, most differential analysis frameworks operate on raw count matrices, preventing users from visually inspecting normalized signal tracks and evaluating how normalization influences the results. To overcome these challenges, we developed GNOMES (Genome-wide NOrmalization of Mapped Epigenomic Signals), a framework that integrates signal normalization, quality control, and differential binding analysis within a unified workflow. ResultsGNOMES is a user-friendly tool able to process ChIP-seq and CUT&RUN datasets from aligned reads, and generate normalized coverage profiles and differential binding results. The tool implements a robust genome-wide normalization strategy based on percentile scaling of signal local maxima, enabling stable normalization between biological replicates and conditions. GNOMES supports both single- and paired- end sequencing, does not required a negative control (input or IGG), and can be applied to both broad (histone marks) or narrow (transcription factor) enrichment patterns. The workflow includes normalization, optional consensus peak identification, and differential binding analysis. For each step, GNOMES generates extensive quality-control metrics and visual outputs, including normalized bigWig tracks, median signal tracks, BED files of regions with significant changes, and diagnostic plots such as heatmaps and PCA. GNOMES is highly configurable and integrates established tools such as MACS2 for candidate peak regions identification for differential binding analysis, as well as DESeq2 and edgeR for statistical testing. Finally, GNOMES is organism-agnostic and can be applied to epigenomic datasets from any model system. ConclusionsGNOMES provides an integrated and highly customizable environment for normalization and differential binding analysis of epigenomic sequencing data. By integrating signal normalization, with downstream differential statistical method for differential binding analysis, and comprehensive quality control, GNOMES simplifies the analysis of ChIP-seq and CUT&RUN datasets, for the identification of chromatin changes.

17
Diet Explains Significant Variance in Oral Microbial Community Structure

Xie, Y.; Bi, M.; Gu, W.; Li, Y.; Roccuzzo, A.; Rosier, B. T.; Tonetti, M.

2026-04-25 dentistry and oral medicine 10.64898/2026.04.24.26351661 medRxiv
Top 4%
1.0%
Show abstract

Diet is an important ecological modulator of the oral microbiome, yet population-level evidence on a broader spectrum of food components remains limited. This cross-sectional study investigated associations among dietary intake, oral rinse microbiome, and oral disease conditions in a nationally representative sample of United States adults from the National Health and Nutrition Examination Survey. A total of 3,254 participants with oral rinse microbiome sequencing data were included, with oral conditions classified as oral health, caries-only, periodontitis-only, or co-existing disease. Dietary intake was assessed using 24-hour dietary recalls and summarized as dietary indices and energy-adjusted food components. Associations between diet and the oral microbiome were evaluated using community-level analyses, regression models, mediation analyses, and unsupervised clustering, while accounting for oral conditions. This study found that dietary intake, as a combined variable set, explained 3.6% of the variance in oral rinse microbial community structure; this was comparable to oral disease status or smoking and larger than sociodemographic factors. Healthier dietary profiles, including higher health-associated dietary index scores and greater vegetable and fruit intake, were associated with taxa commonly linked to oral health (e.g., Neisseria, Cardiobacterium and Lautropia). In contrast, added sugars, alcoholic drinks, cured meat, potatoes, dairy products, and higher dietary inflammatory index scores showed opposite association patterns. Mediation analyses suggested that coordinated microbial groups may partly link dietary exposures with oral disease outcomes, particularly for vegetables and added sugars. Additionally, three population-level dietary patterns were identified, among which the plant-rich pattern was associated with more favorable oral health and microbial profiles enriched in nitrate-reducing commensals, including Neisseria and Haemophilus. Overall, dietary intake was associated with oral microbiota composition and oral health conditions, supporting ecological influences of dietary components beyond sugar on oral bacteria and dental diseases. Longitudinal studies are needed to clarify the direction and causality of these relationships.

18
Enhanced Viral Detection in Grapevine via Exome Depletion and Next-Generation Sequencing (NGS)

Cuello, R. A.; Zavallo, D.; Vera, P.; Sattler, A.; Puebla, A. F.; Debat, H. J.; Gomez Talquenca, S.; asurmendi, s.

2026-04-20 plant biology 10.64898/2026.04.16.718969 medRxiv
Top 4%
1.0%
Show abstract

Grapevine (Vitis vinifera L.) is highly prone to viral infections that pose a significant threat to global viticulture sustainability. Traditional detection methods, such as PCR and ELISA, are limited to well-known pathogens, highlighting the need for more comprehensive and unbiased approaches. Here, we present the development of a cost-effective viral enrichment system adapted to next-generation sequencing (NGS) for the detection and characterization of grapevine viruses. Our strategy leverages hybridization-based capture using biotin-labeled cDNA probes hereafter named "Chloro-Zero") designed to selectively deplete highly abundant host transcripts particularly plastid and ribosomal RNAs while preserving viral RNA. Probe design was informed by transcriptomic analysis of V. vinifera. We evaluated different subtractor-to-target RNA ratios, observing a consistent reduction of host RNA and a moderate enrichment of viral sequences. NGS analysis revealed improved recovery of low-abundance viral transcripts, with coverage levels comparable, to a certain extent, to those obtained using previously available commercial kits, but at a significantly lower cost. Although variability in depletion efficiency was observed, the results demonstrate the potential of this scalable and locally adaptable protocol for virome profiling in grapevines. By addressing key limitations of current depletion methods, our approach facilitates the detection of emerging viral threats and supports the development of more effective certification programs and sustainable management practices. Ongoing improvements in probe design and bioinformatic workflows are expected to enhance performance, providing a robust platform for broader applications in plant virology.

19
High-variance phenome database reveals important roles of WD40 proteins in the plant pathogenic fungus Fusarium graminearum

Choi, S.; Lee, N.; Jeon, H.; Park, J.; Kim, S.; Kim, J.-E.; Shin, J.; Moon, H.; Min, K.; Choi, Y.; Hwangbo, A.; Kim, H.; Choi, G. J.; Lee, Y.-W.; Song, D.-G.; Son, H.

2026-04-20 molecular biology 10.64898/2026.04.19.719521 medRxiv
Top 5%
0.9%
Show abstract

O_LIWD40 is a highly conserved protein domain in eukaryotes, playing a critical role in various cellular process. C_LIO_LIWe conducted genome-wide functional analysis of WD40 genes in Fusarium graminearum--a phytopathogenic fungus that causes severe yield loss and mycotoxin contamination in major cereal crops. C_LIO_LIComprehensive phenome analysis of 119 WD40 gene deletion mutants across 22 distinct phenotypic traits revealed phenotypic divergence within the phenome, establishing a strong correlation between virulence and sexual reproduction. Notably, 21 "core WD40 genes" were identified, offering valuable insights into divergent biological processes. C_LIO_LIPilot interactome studies of Fgwd101 and Fgwd133 provided further insights into their potential pathobiological functions. Our investigation contributes to broadening our knowledge of the biological mechanisms underlying fungal pathogenesis and may assist in the identification of targets for antifungal agents. C_LI

20
EcoCore; An ecologically diverse panel of Arabidopsis thaliana accessions for studying plant-environment interactions

van Eijnatten, A. L.; Keijzer, J. J.; Trenner, J.; Delker, C.; Quint, M.; Van Zanten, M.; Snoek, L. B.

2026-04-21 plant biology 10.64898/2026.04.17.719158 medRxiv
Top 5%
0.9%
Show abstract

Arabidopsis thaliana naturally occurs across a wide geographic range and displays extensive natural variation in several traits including adaptive responses to the abiotic environment (e.g. temperature, drought, salt). Quantitative techniques like Genome Wide Association Studies (GWAS) enable mapping the genetic basis of such environmental responses and benefits from extensive genetic variation, but the size of the chosen diversity panel is often limited by phenotyping capacity. Most studies therefore use subpanels, often based on maximization of genetic diversity. However, this type of selection may overrepresent cosmopolitan alleles and underrepresent rare environment-specific alleles. Here, we demonstrate that the genetic variation in a GWAS subpanel of Arabidopsis thaliana accessions depends almost entirely on the number of accessions in the panel and very little on the composition of the panel. We present the EcoCore panel designed by grouping accessions of the 1001 genomes (1001G; 1135 accessions) collection, based on their native collection environment and selecting an equal number of accessions from each environment. We assessed hypocotyl lengths of plants grown at control and ambient high temperatures (20{degrees}C and 28{degrees}C) for 913 accessions of the 1001G and mapped these traits with the full 1001G panel versus the EcoCore panel. The EcoCore panel revealed novel genetic associations with hypocotyl length which is attributed to enrichment of alleles from rare environments. We present the EcoCore panel as a manageable resource for studying phenotypic plasticity and the genetic basis of plant-environment interactions.